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Abstract: When a sample frequency table is published, disclosure risk arises 
when some individuals can be identified on the basis of their values in certain 
attributes in the table called key variables, and then their values in other 
attributes may be inferred, and their privacy is violated. 

On the basis of the sample to be released, and possibly some partial knowl- 
edge of the whole population, an agency which considers releasing the sample, 
has to estimate the disclosure risk. 

Risk arises from non-empty sample cells which represent small population 
cells and from population uniques in particular. Therefore risk estimation re- 
quires assessing how many of the relevant population cells are likely to be small. 
Various methods have been proposed for this task, and we present a method 
in which estimation of a population cell frequency is based on smoothing using 
a local neighborhood of this cell, that is, cells having similar or close values in 
all attributes. 

We provide some preliminary results and experiments with this method. 
Comparisons are made to two other methods: 1. a log-linear models approach 
in which inference on a given cell is based on a "neighborhood" of cells deter- 
mined by the log-linear model. Such neighborhoods have one or some common 
attributes with the cell in question, but some other attributes may differ sig- 
nificantly. 2 The Argus method in which inference on a given cell is based 
only on the sample frequency in the specific cell, on the sample design and on 
some known marginal distributions of the population, without learning from 
any type of "neighborhood" of the given cell, nor from any model which uses 
the structure of the table. 



1. Introduction 

When a microdata sample file is released by an agency, directly identifying variables, 
such as name, address, etc., are always deleted, variable values are often grouped 
(e.g., Age-Groups instead of precise age), and the data is given in the form of a 
frequency table. However disclosure risk may still exist, that is, some individuals in 
the file may be identified by their combination of values in the variables appearing 
in the data. 

Samples often contain information on certain variables on which the agency's 
information for the whole population is limited, such as expenditure on specific 
items in a Household Expenditure Survey, or detailed information on variables such 
as children's extra curricular activities in the Social Survey of the Israel Central 
Bureau of Statistics. 
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Often agencies have to assess the disclosure risk involved in the release of sample 
data in the form of a frequency table when the corresponding population table may 
be unknown, or only partially known. Risk arises from cells in which both sample 
and population frequencies are small, allowing an intruder who has the sample data 
and access to some information on the population, and in particular on individuals 
of interest, to identify such individuals in the sample with high probability. Thus, 
the disclosure risk depends both on the given sample, and the population. In this 
paper we are concerned with the issue of estimating disclosure risk involved in 
releasing a sample on the basis of the sample alone, assuming the population is 
unknown. 

Let f — {fk} denote an m-way frequency table, which is a sample from a pop- 
ulation table F = {Fk}, where k — {ki, . . . , k„i) indicates a cell and fk and Ff, 
denote the frequency in the sample and population cell fc, respectively. Formally, 
the sample and population sizes in our models are random and their expectations 
are denoted by n and N respectively, and the number of cells by K. We can ei- 
ther assume that n and N are known, or that they are estimated by their natural 
estimators: the actual sample and population sizes, assumed to be known. In the 
sequel when we write n of N we formally refer to expectations. 

If the m attributes in the table can be considered key variables, that is, variables 
which are to some extent accessible to the public or to potential intruders, then 
disclosure risk arises from cells in which both fk and Fk are positive and small, 
and in particular when fk~Fk — l (sample and population uniques). Suppose an 
intruder locates a sample unique in cell k, say, and is aware of the fact that the 
combination of values k — (fci, . . . , km) happens to be unique or rare in the pop- 
ulation. If this combination matches an individual of interest to the intruder then 
identification can be made with high probability on the basis of the m attributes. 
If the sample contains information on the values of other attributes, then these 
can now be inferred for the individual in question, and his privacy is violated. In 
many countries this would constitute a violation of law. For example The Central 
Bureau of Statistics in Israel operates under the Statistics Ordinance (1972) which 
says "No information. . . shall be so [published] as to enable the identification of the 
person to whom it relates" . 

A global risk measure quantifies an aspect of the total risk in the file by aggre- 
gating risk over the individual cells. For simplicity we shall focus here only on two 
global measures, which are based on sample uniques: 

n=^I(/fe-l,F,.-l), r2-^I(/fe-l)^, 
k k 

where I denotes the indicator function. Note that ti counts the number of sample 
uniques which are also population uniques, and T2 is the expected number of correct 
guesses if each sample unique is matched to a randomly chosen individual from 
the same population cell. These measures are somewhat arbitrary, and one could 
consider measures which reflect matching of individuals that are not sample uniques, 
possibly with some restrictions on cell sizes. Also, it may make sense to normalize 
these measures by some measure of the total size of the table, by the number of 
sample uniques, or by some measure of the information value of the data. 

Various individual and global risk measures have been proposed in the literature, 
see e.g.jBenedetti et al. [ll@l, Skinner and Holmes Elamir and Skinner 
Rinott 

In Section [3] we propose and explain a new method of estimation of quantities 
like Ti and T2, using a standard Poisson model, and local smoothing of frequency 
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tables. The method is based on the idea that one can learn about a given population 
cell from neighboring cells, if a suitable definition of closeness is possible, without 
relying on complex modeling. In Sections 12 . II and [2 . 21 we briefly describe two known 
methods of estimation of quantities like ri and T2, and in Section [4] we provide real 
data experiments which compare the methods discussed. 

We consider the case that f is known, and F is an unknown parameter (on which 
there may be some partial information) and the quantities ti and T2 should be 
estimated. Note that they are not proper parameters, since they involve both the 
sample f and the parameter F. 

The methods discussed in this paper consist of modeling the conditional distri- 
bution of F|f, estimating parameters in this distribution and then using estimates 
of the form 

(1) fi = = l)PiF, = = 1), T2 = 5]l(A. = imjrlfk = 1] , 

k k ^ 

where P and E denote estimates of the relevant conditional probability and expec- 
tation. For a general theory of estimates of this type see Zhang and references 
therein. Some direct variance estimates appear in Rinott Q . 



2. Models 



For completeness we briefly introduce the Poisson and Negative Binomial models. 
More details can be found, for example, in Bethlehem et al. [Sf], Cameron and 
Trivedi [J, Rinott [|. 

A common assumption in the frequency table literature is Fk ^ Poisson(iV7fc), 
independently, where N is assumed to be a known parameter, and ^7fc = 1. 
Binomial (or Poisson) sampling from means that fk\Fk ~ Bin{Fk,nk), where 
each TT/c is a known constant which is part of the sampling design, called the sampling 
fraction in cell k. By standard calculations we then have 

(2) /fc - Poisson(7V7fe7rfc) and Fk\fk ^ /fe + Poisson(A^7fc(l - tt^)) , 

leading to the Poisson model of subsection 12. II below. 

Under this model the population size is random with expectation TV, and so is 
the sample size, with expectation TV J^k IkT^k which we denote by n. In practice 
we have in mind that N and n could be estimated by the actual population and 
sample sizes, and these estimates could be "plugged in" where needed. 

If one adds the Bayesian assumption 7^ ~ Gamma(Q!, /?) independently, with 
a/3 = l/K to ensure that Ej^lk = 1, then fk ^ NB{a,pk = j^iw^)^ 
Negative Binomial distribution defined for any a > by 

pifk = ^) = ^;p^^^{i-Pkrpt, x = o,i,2,..., 

L[x + 1)1 (a) 

which for a natural a counts the number of failures until a successes occur in 
independent Bernoulli trials with probability of success pk- Further calculations 
yield Fk I fk - fk + NB{a + fk, (Fk > fk). Note that in this model the 

population size is again random with expectation and now the sample size has 
expectation Nj^k ^k/K which we denote again by n. 

As a ^ (and hence (3 00) we obtain Fk \ fk ^ fk + NB{fk,TTk), which is 
exactly the Negative Binomial assumption in Section 12.21 below. As a ^ cxo the 
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Poisson model of Section \2A\ is obtained, and in this sense the Negative Binomial 
with parameter a subsumes both models. 

Next we discuss two methods which have received much attention. They have 
been applied in some bureaus of statistics recently, and are being tested by others. 



2.1. The Poisson log-linear method 

Skinner and Holmes and Elamir and Skinner [(^ proposed and studied the 
following approach. Assuming a fixed sampling fraction, that is, — tt, the first 
part of ([2|) implies fk ^ Poisson(n7fc), where n = Ntt. Using the sample {/&} 
one can fit a log- linear model using standard programs, and obtain estimates {7fc} 
of the parameters. Goodness of fit measures for selecting models having good risk 
estimates were studied in Skinner and Shlomo [llj . 

Using the second part of ^ it is easy to compute individual risk measures for 
cell k, defined by 



P{Fk ^ l\fk = 1) = e-^-"'('-^''\ 
^ ' 1 , „ , 1 



E[—\h = l] = 



e 



-N'ykil-TTk)] 



Plugging 7fc for 7^ in ^ leads to the desired estimates P{Fk = l\fk — 1) and 
E[^\fk — 1] and then to fi and f2 of ([1]). 

For each k we therefore obtain estimates of P{Fk — l\fk = 1) and E[-p^\fk = 1] 
which depend on 7^, which in turn depends on the frequencies in other cells. For 
example, in a log-linear model of independence, 7^ depends on the frequencies in 
all cells which have a common attribute with k. Thus cells that are rather different 
in nature, having values which are very different from those of cell k in most of the 
attributes, influence the estimates of the parameter 7/c pertaining to this cell. 

The main goal of this paper is to study the possibility of estimating 7^ using cells 
in more local "neighborhoods," having attribute values which are closer to those of 
the cell k in cases where closeness can be defined. 



2.2. The Argus method 

This method, proposed by Benedetti et al. was originally oriented towards in- 

dividual risk estimation, but was subsequently also applied to global risk measures, 
see, e.g, Polettini and Seri and Rinott f§|. Argus has recently been implemented 
in some European statistical bureaus. 

In the Argus model it is assumed that Fk\fk ~ fk + NB{fk, TTk) with an implicit 
assumption of independence between cells. Since tt^ are assumed known we could 
now calculate PTr^.{Fk — l\fk = 1) and £"^^.[-^1//!; = 1]. However because of non 
response, sampling biases and errors, Argus does not use the known tt^, but rather 
estimates them from the sampling weights as discussed next. 

At statistics bureaus, each statistical unit responding to a sample survey is as- 
signed a sampling weight. This weight Wi is an inflating factor that informs on 
the number of units in the population that are represented by sample unit i, to be 
used for inference from the sample to the population. It is calculated by the inverse 
sampling fraction that is adjusted for non-response or other biases that may occur 
in the sampling process. These adjustments are often carried out within post-strata 
(weighting classes) defined by known marginal distributions of the populations. 
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such as Age, Sex and Geographical Location. The inverse sampHng fractions are 
cahbrated so that the weighted sample count in each post-strata is equal to the 
known population total; this calibration reduces under or over representation of 
the chosen strata due to any bias, or sampling errors. 

The Argus method provides initial estimates of the population cell sizes of the 
form Fk — X^iGcoiifc^*' ■^here Wi denotes the sampling weight of individual i 
described above (see also example below). 

Here is a simple example: 

Suppose for simplicity that the sampling weights are based only on the sampling 
design, and on post stratification by a single variable, say Sex, and that the sample 
is designed to be a random subset consisting of one percent of the population and 
therefore we have the same sampling fraction of tt = 1/100 in each Sex group. If 
males, say, have a non-response rate of 20%, and females of 0%, then the sampling 
weight for women in the sample would be Wi = 100, and for men Wi ~ 100/0.8 — 
125. 

If in the sample table there is a cell k = {ki, ^2) where fci stands for Male, and 
k2 stands for the level in another attribute, such as Income, and fk — 20, then in 
this ceh all are 125, and = 20 * 125 = 2500. 

Now suppose Sex is not one of the variables in the table to be released, but 
the agency knows it for all individuals in the sample. Suppose the variables in 
the table are Income and Occupation, and suppose now k — (fci,fc2), where fci 
stands for a given Income group, and k2 for a given Occupation. Suppose fk = 20, 
meaning that in the sample there are 20 individuals with the given income group 
and occupation, and suppose that there are 10 males and 10 females in this group. 
The weight Wi = 100 for the 10 females, and 125 for the 10 males, and therefore 
Ffe = 10 * 100 + 10 * 125 = 2250. 

In the above example sampling weights reflect non response. In principle a bureau 
may arrive at such weights also because in the original sampling design men are 
under represented, or because it finds out that this is the case after post stratifying 
on Sex and observing that males are under represented due to some reasons (some 
bias, including non-response, or sampling error). 

Returning to Argus, recall its initial estimates of the population cell sizes Fk — 
J2ie coll k Using the relation E^i^[Fk\fk] = fk/i^k , the parameters tt^ are esti- 
mated using the moment-type estimate nk = fk/ Fk ■ Note that if Fk were known, 
this would be the usual estimate of the binomial sampling probability. 

Straightforward calculations with the Negative Binomial distribution show 

Prr^Fk = l\fk = 1) = TTfc and E^^i^lfk = 1] = --^log(7rfc) . 

Plugging these estimates for P and E in ^ we obtain the estimates fi and T2 of 
the global risk measures. Note that in this method the cells are treated completely 
independently, each cell at a time, and the structure of the table, or relations 
between different cells play no role. Moreover, since this method does not involve a 
model which reduces the number of parameters, it is required to estimate essentially 
K parameters, which is typically hard in sparse tables of the kind we have in mind. 

3. Smoothing polynomials and local neighborhoods 

The estimation question here is essentially the following: given, say, a sample unique, 
how likely is it to be also a population unique, or arise from a small population cell. 
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If a sample unique is found in a part of the sample table where neighboring cells 
(by some reasonable metric, to be discussed later) are small or empty, then it seems 
reasonable to believe that it is more likely to have arisen from a small population 
cell. This motivates our attempt to study local neighborhoods, and compare the 
results to the type of model-driven neighborhood as the log-linear method, and the 
Argus method which uses no neighborhoods. 

Consider frequency tables in which some of the attributes are ordinal, and define 
closeness between categories of an attribute in terms of the order, or more generally, 
suppose that for a certain attribute one can say that some values of the attribute 
are closer to a given value than others. For example. Age and Years of Education 
are ordinal attributes, and naturally the age of 5 is closer to 6 than to 7 or 17, say, 
while Occupation is not ordinal, but one can try to define reasonable notions of 
closeness between different occupations. 

Classical log-linear models do not take such closeness into account, and therefore, 
when such models are used for individual cell parameter estimation, the estimates 
involve data in cells which may be rather remote from the estimated cell. 

On the other hand, as mentioned above, the Argus method bases its estimation 
only on the sampling weight of the estimated population cell. There is no learning 
from other cells, the structure of the table plays no role, and each cell's parameter 
is estimated separately. 

We now describe our proposed approach which consists of using local neighbor- 
hoods of the estimated cell. 

Returning to ([2]) we assume that fk ^ Poisson(Afe = N^kHk)- Apart from con- 
stants, the sample log-likelihood is X^t^i I/fe lo§ A/c — \k\- However if we use a 
model for Afc which is valid only in some neighborhood M of a given cell, we shall 
consider the log-likelihood of the data in this neighborhood, that is 

(4) 5][/,logA,-Afc]. 

feeM 

For convenience of notation we now assume that m = 2, that is, we consider two-say 
tables; the extension to any m is straightforward. Following Simonoff see also 
references therein, we use a local smoothing polynomial model. 

For each fixed k — {ki, k2) separately, we write the model below for A^' in terms 
of the parameters q;=(/3o, 7i, . . . , Ai 7t)j with k' = {k'i,k'2) varying in some 
neighborhood of k: 

(5) logAfc'(a) = logA(fc'^^^) 

= /?o + /3i(fci-fci) +71(^2-^2) + •■ • 
+ (3t{k'i-kiY+lt{k'2~k2)\ 

for some natural number t. One can hope that such a polynomial model is valid with 
a suitable t for k' = [k'-^, k'2) in some neighborhood M oi k — (fci, /c2)- Substituting 
(jS]) into Q we maximize the concave function 

(6) L(q:) = i(/3o,/3i,7i, . . . ,/3t,7t) = ^ [/(fe;,feg log A(fc'^_^) - A(,,r)] 

{k[,k'^)eM 

with respect to the coefficients in a of the regression model ([5]). With argmax 
L{a) = a, and Pq denoting its first component, we finally obtain our estimate of 
Afe = \kiM) ill form 



(7) 



Afe = Afe(Q:) = exp(/3o), 
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where the second equahty is explained by taking k' = k = (fci,fc2) in ©. The 
maximization by the Newton-Raphson method is rather straightforward and fast. 

Each of the estimates Xk requires a separate maximization as above which leads 
to a value a that depends on fc = (fci, /C2), and a set of estimates A/c'(q;), of which 
only Afc of ([7|) is used. For the risk measure discussed in this paper, it suffices to 
compute these estimates for cells k which are sample uniques, that is, fk — 1. 

Equating the partial derivative of the function of ^ with respect to /3o to zero we 
obtain X)fc'ej\f ^k'{o.) = X^fc'gM/fc'' ^^'^ other derivatives yield moment identities. 
Note, however, that these desirable identities hold for Xk' (a) which are obtained 
for a fixed k = (fci,A:2), and not for our final estimates in ([7]), which are the ones 
we use in the sequel. 

With the estimate of ([7|), recalling Xk — NjkT^k and setting U ~ {k : fk — 1}, 
the set of sample uniques, we now apply the Poisson formulas ([3]), see also to 
obtain the risk estimates 

(8) n = XI e-^''(l-'^'=)/'^^ T2 = X ^- [1 - £^^''(1-'^^)/'^'=]. 

keu keu Afe(l-7rfc)/7rfc 

In our experiments we defined neighborhoods M of k by varying around k co- 
ordinates corresponding to attributes that are ordinal, and using close values in 
non-ordinal attributes when possible (e.g., in Occupation). Attributes in which 
closeness of values cannot be defined remain constant in the whole neighborhood. 
Thus in our experiments, neighborhoods always consist of individuals of the same 
Sex. For more details see Section |4l 

4. Experiments with neighborhoods 

We present a few experiments. They are preliminary as already mentioned and more 
work is needed on the approach itself and on classifying types of data for which it 
might work. 

In the experiments we used our own versions of the Argus and log-linear mod- 
els methods, programmed on the SAS system. Throughout our experiments two 
log-linear models are considered, one of independence of all attributes, the other 
including all two-way interactions. 

The weights Wi for the Argus method in all our examples were computed by 
post-stratification on Sex by Age by Geographical location (the latter is not one of 
the attributes in any of the tables, but it was used for post-stratification). These 
variables are commonly used for post-stratification, other strata may give different, 
and perhaps better results. 

In all experiments we took a real population data file of size N given in the form 
of a contingency table with K cells, and from it we took a simple random sample of 
size n. Since the population and the sample are known to us, we can compute the 
true values of ri and T2 and their estimates by the different methods, and compare. 

Example 1. In this small example the population consists of a small extract from 
the 1995 Israeli Census with individuals of age 15 and over, with N = 15, 035 and 
K — 448. From this population we took a random sample of size n — 1, 504, using a 
fixed sampling fraction, that is TTfc = n/N for all k. The sampling fraction is constant 
in all our experiments. The attributes (with number of levels in parentheses) were 
Age Groups (32), and Income Groups (14), both ordinal. 

As mentioned above, throughout our experiments two log-linear models are con- 
sidered, one of independence, the other including all two-way interactions (which 
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Table 1 



Example 1 Example 2 



Model 




Tl 


T2 


Tl 


T2 


True Values 




2 


12.4 


2 


19.9 


Argus 




7.8 


19.6 


14.7 


37.2 


Log Linear Model: 












Independence 




0.06 


6.7 


0.01 


9.8 


Log Linear Model: 












2- Way Interactions 




0.01 


8.6 


1.4 


19.6 


Smoothing t = 1 \M\ 


= 49 


3.2 


12.0 


7.0 


22.5 


Smoothing t = 2 \M\ 


= 49 


1.7 


10.4 


4.8 


19.0 



in the present case of two attributes, is a saturated model). In this experiment we 
tried our proposed smoothing polynomial approach of ([5|) for t — 1,2. We consid- 
ered one type of neighborhood here, constructed by changing each attribute value 
in k by at most 3 values up or down, that is, the neighborhood of each cell k is 

(9) M = {k' : max \kl ~ k,\ < c}, 

l<i<m 

with m = 2, c = 3 and hence size \M\ = 49. For cells near the boundaries some 
of the cells in their neighborhoods do not exist; here we set non-existing cells' 
frequencies to be zero, but other possibilities can be considered. 

Table 1 presents the true r values and their estimates by the methods described 
above. 

Example 2. The population consists of an extract from the 1995 Israeli Census, 
N = 37, 586, n = 3, 759, and K = 896. The attributes are Sex(2) * Age Groups 
(32) * Income Groups(14). 

We applied the smoothing polynomial of ([5|) for t = 1,2 and neighborhoods 
obtained by varying the attributes of Age and Income as in Example 1 and keeping 
Sex fixed. In other words we used the neighborhoods 

(10) M = {k' : k[ = ki, max |fc^ - ki\ < c}, 

2<i<m 

with m = 3, c = 3 which are like © on each sub-table of males and females. The 
results are given in Table 1. 

Example 3. Population: an extract from the 1995 Israeli Census. N — 37, 586, 
n = 3, 759, K = 11, 648. Attributes: Sex(2) * Age Groups (32) * Income Groups(14) 
* Years of Study (13). 

We applied the smoothing polynomial of ^ for i = 2 and neighborhoods ob- 
tained by fixing Sex, so neighborhoods are as in pU|). but with to = 4, c = 2, 



Table 2 



Model 


Tl 


T2 


True Values 


187 


452.0 


Argus 


137.2 


346.4 


Log Linear Model: 






Independence 


217.3 


518.0 


Log Linear Model: 






2- Way Interactions 


167.2 


432.8 


Smoothing f = 2 |Af| = 125 


170.7 


447.9 
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Table 3 



Model 




Tl 


T2 


True Values 




191 


568.0 


Argus 




79.2 


315.6 


Log Linear Model: 








Independence 




364.8 


862.3 


Log Linear Model: 








2- Way Interactions 




182.2 


546.2 


Smoothing t = 2\M\ = 


545 


139.6 


509.1 


Smoothing t = 2\M\ = 


625 


154.7 


528.5 


Smoothing t = 2\M\ = 


1025 


215.7 


647.2 


Table 4 


Model 




n 


T2 


True Values 




5 


36.9 


Argus 




7.7 


35.5 


Log Linear Model: 








Independence 




6.4 


44.2 


Log Linear Model: 








2- Way Interactions 




1.1 


26.4 


Smoothing i = 2 |M| 


= 125 


3.3 


31.3 



and since we now vary three variables, each over a range of five values, we have 
\M\ = 125. The results are given in Table 2. 

Example 4. Population: an extract from the 2001 UK Census File. N = 944, 793, 
n = 18, 896, K = 152, 100. Attributes: Sex (2) * Age Groups (25) * Number of 
Persons in Household (9) * Education Qualifications (13) * Occupation (26). 

We applied the smoothing polynomial of ([5]) for t — 2 and neighborhoods defined 
by fixing Sex and varying all other variables, including Occupation, which was coded 
as ordinal. The neighborhoods are 

(11) M = {/c' : a; = ki, max |fc^ - ki\ < c, V |/c- - ki\ < d}, 

2<i<va — ^ 
i 

with m = 5, c = 2 and d — 6,8, resulting in neighborhood sizes \M\ — 545 and 
625, respectively. We also tried c = 3, d = 6 and hence |Af | ~ 1025. The results 
are given in Table 3. 

Example 5. Population: an extract from the 1995 Israeli Census. N — 248, 983, 
n = 2,490, K = 8,800. Attributes: Sex(2)* Age Groups(16) * Years of Study (25) 
* Occupation (11) . 

We applied the smoothing polynomial of (O for i = 2 and neighborhoods ob- 
tained by varying three attributes and fixing Sex so neighborhoods as in (jlOp with 
TO = 4, c = 2, and |A/| = 125. The results are given in Table 4. 

Example 6. Population: an extract from the 1995 Israeli Census. N ~ 746, 949, 
n = 14,939, K = 337,920. Attributes: Sex (2) * Age Groups (16) * Years of Study 
(10) * Number of Years in Israel (11) * Income Groups (12) * Number of Persons 
in Household (8). Note that this is a very sparse table. 

We applied the smoothing polynomial of (O for i = 2 and neighborhoods ob- 
tained by varying all attributes except for Sex which was fixed. Neighborhoods are 
as in ini) with m = 6, c = 2, d = 4, 6, and \M\ = 581 and 1, 893, respectively. The 
results are given in Table 5. 



170 



Y. Rinott and N. Shlomo 



Table 5 



Model 




n 


T2 


True Values 




430 


1,125.8 


Argus 




114.5 


456.0 


Log Linear Model: 








Independence 




773.8 


1,774.1 


Log Linear Model: 








2- Way Interactions 




470.0 


1,178.1 


Smoothing t = 2\M\ = 


581 


287.1 


988.4 


Smoothing t = 2 \M\ = 


1,893 


471.1 


1,240.2 


Table 6 


Model 




n 


T2 


True Values 




42 


171.2 


Argus 




20.7 


95.4 


Log Linear Model: 








Independence 




28.8 


191.5 


Log Linear Model: 








2- Way Interactions 




35.8 


164.1 


Smoothing t = 2 \M\ 


= 545 


37.1 


175.1 



Example 7. Population: an extract from the 1995 Israeli Census. N — 746,949, 
n = 7,470, K = 42,240. Attributes: Sex (2) * Age Groups (16) * Years of Study 
(10) * Number of Years in Israel (11) * Income Groups (12). 

We applied the smoothing polynomial of ([5]) for t ~ 2 and neighborhoods ob- 
tained by varying all attributes except for Sex which was fixed. Neighborhoods are 
as in (fTT|) with m — 5, c = 2, d — 6, and |Af| = 545. Smaller neighborhood did not 
yield good estimates. The results are given in Table 6. 

Discussion of examples The log-linear model method was tested in Skinner and 
Shlomo TT] and references therein, and it seems to yield good results for exper- 
iments of the kind done here. Di Consiglio et al. [5| presented experiments for 
individual risk assessment with Argus, which seems to perform less well than the 
log-linear method in many of our experiments with global risk measures. Our new 
method still requires fine-tuning. At present the results seem comparable to the log- 
linear method, and it seems to be computationally somewhat simpler and faster. 

Naturally, more variables and sparse data sets with a large number of cells are 
typical and need to be tested. Such files will cause difficulties to any method, and 
this is where the different methods should be compared. In sparse multi-way tables, 
model selection will be crucial but difficult for the log-linear method, and perhaps 
simpler for the smoothing approach. We also think that our method may be easier 
to modify to complex sampling designs. 

Our proposed method is at a preliminary stage and requires more work. Partic- 
ular directions are the following: 

1. Adjust the estimates 7fc of ([7]) to fit known population marginals obtained from 
prior knowledge and sampling weights. In log-linear models the total sum of these 
estimates corresponds to the sample size, but as commented in Section [3] this is not 
the case with the smoothing estimates of ([7]). 

2. Use goodness of fit measures and information on population marginals and sam- 
pling weights to select the type and size of the neighborhoods, and the degree of the 
smoothing polynomial in Q . We have observed in experiments that when the sum 
of all estimates matches the sample size, we obtain good risk measure estimates. 
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and further matching to marginals may improve the estimates. 

3. Extend the smoothing approach to the more general Negative Binomial model 
which subsumes both the Poisson model implemented here, and the Negative Bi- 
nomial discussed in Section [2l 

4. Apply this method also for individual risk measure estimates, which are im- 
portant in themselves, and may also shed more light on efficient neighborhood and 
model selection. Our preliminary experiments suggest that the smoothing approach 
performs relatively well in estimating individual risk. 
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